Internetarchive: A Python Interface to archive.org
On this page
Internetarchive: A Python Interface to archive.org#
Internetarchive Library#
Internetarchive is a python interface to archive.org.
Usage:
>>> from internetarchive import get_item
>>> item = get_item('govlawgacode20071')
>>> item.exists
True
- copyright
2012-2019 by Internet Archive.
- license
AGPL 3, see LICENSE for more details.
internetarchive.Item
#
- class Item(archive_session, identifier: str, item_metadata: Mapping | None = None)[source]#
Bases:
internetarchive.item.BaseItem
This class represents an archive.org item. Generally this class should not be used directly, but rather via the
internetarchive.get_item()
function:>>> from internetarchive import get_item >>> item = get_item('stairs') >>> print(item.metadata)
Or to modify the metadata for an item:
>>> metadata = {'title': 'The Stairs'} >>> item.modify_metadata(metadata) >>> print(item.metadata['title']) 'The Stairs'
This class also uses IA’s S3-like interface to upload files to an item. You need to supply your IAS3 credentials in environment variables in order to upload:
>>> item.upload('myfile.tar', access_key='Y6oUrAcCEs4sK8ey', ... secret_key='youRSECRETKEYzZzZ') True
You can retrieve S3 keys here: https://archive.org/account/s3.php
- dark(comment: str, priority: int | str | None = None, data: Mapping | None = None, reduced_priority: bool = False, request_kwargs: Mapping | None = None) Response [source]#
Dark the item.
- Parameters
comment – The curation comment explaining reason for darking item
priority – The task priority.
reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than
priority
in that it will allow you to possibly avoid rate-limiting.data – Additional parameters to submit with the task.
- Returns
- derive(priority: int = 0, remove_derived: str | None = None, reduced_priority: bool = False, data: MutableMapping | None = None, headers: Mapping | None = None, request_kwargs: Mapping | None = None) Response [source]#
Derive an item.
- Parameters
priority – Task priority from 10 to -10 [default: 0]
remove_derived – You can use wildcards (“globs”) to only remove some prior derivatives. For example, “*” (typed without the quotation marks) specifies that all derivatives (in the item’s top directory) are to be rebuilt. “.mp4” specifies that all “.mp4” deriviatives are to be rebuilt. “{.gif,*thumbs/.jpg}” specifies that all GIF and thumbs are to be rebuilt.
reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting.
- Returns
- download(files: File | list[File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, exclude_pattern: str | None = None, dry_run: bool = False, verbose: bool = False, ignore_existing: bool = False, checksum: bool = False, destdir: str | None = None, no_directory: bool = False, retries: int | None = None, item_index: int | None = None, ignore_errors: bool = False, on_the_fly: bool = False, return_responses: bool = False, no_change_timestamp: bool = False, ignore_history_dir: bool = False, source: str | list[str] | None = None, exclude_source: str | list[str] | None = None, stdout: bool = False, params: Mapping | None = None, timeout: int | float | tuple[int, float] | None = None) list[Request | Response] [source]#
Download files from an item.
- Parameters
files – Only download files matching given file names.
formats – Only download files matching the given Formats.
glob_pattern – Only download files matching the given glob pattern.
exclude_pattern – Exclude files whose filename matches the given glob pattern.
dry_run – Output download URLs to stdout, don’t download anything.
verbose – Turn on verbose output.
ignore_existing – Skip files that already exist locally.
checksum – Skip downloading file based on checksum.
destdir – The directory to download files to.
no_directory – Download files to current working directory rather than creating an item directory.
retries – The number of times to retry on failed requests.
item_index – The index of the item for displaying progress in bulk downloads.
ignore_errors – Don’t fail if a single file fails to download, continue to download other files.
on_the_fly – Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
return_responses – Rather than downloading files to disk, return a list of response objects.
no_change_timestamp – If True, leave the time stamp as the current time instead of changing it to that given in the original archive.
source – Filter files based on their source value in files.xml (i.e. original, derivative, metadata).
exclude_source – Filter files based on their source value in files.xml (i.e. original, derivative, metadata).
params – URL parameters to send with download request (e.g. cnt=0).
ignore_history_dir – Do not download any files from the history dir. This param defaults to
False
.
- Returns
True if if all files have been downloaded successfully.
- fixer(ops: list | str | None = None, priority: int | str | None = None, reduced_priority: bool = False, data: MutableMapping | None = None, headers: Mapping | None = None, request_kwargs: Mapping | None = None) Response [source]#
Submit a fixer task on an item.
- Parameters
ops – The fixer operation(s) to run on the item [default: noop].
priority – The task priority.
reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than
priority
in that it will allow you to possibly avoid rate-limiting.data – Additional parameters to submit with the task.
- Returns
- get_all_item_tasks(params: dict | None = None, request_kwargs: Mapping | None = None) list[catalog.CatalogTask] [source]#
Get a list of all tasks for the item, pending and complete.
- Parameters
params – Query parameters, refer to Tasks API for available parameters.
request_kwargs – Keyword arguments that
requests.get()
takes.
- Returns
A list of all tasks for the item, pending and complete.
- get_catalog(params: Mapping | None = None, request_kwargs: Mapping | None = None) list[catalog.CatalogTask] [source]#
Get a list of pending catalog tasks for the item.
- Parameters
params – Params to send with your request.
- Returns
A list of pending catalog tasks for the item.
- get_file(file_name: str, file_metadata: Mapping | None = None) File [source]#
Get a
File
object for the named file.- Parameters
file_metadata – a dict of metadata for the given file.
- Returns
An
internetarchive.File
object.
- get_history(params: Mapping | None = None, request_kwargs: Mapping | None = None) list[catalog.CatalogTask] [source]#
Get a list of completed catalog tasks for the item.
- Parameters
params – Params to send with your request.
- Returns
A list of completed catalog tasks for the item.
- get_task_summary(params: Mapping | None = None, request_kwargs: Mapping | None = None) dict [source]#
Get a summary of the item’s pending tasks.
- Parameters
params – Params to send with your request.
- Returns
A summary of the item’s pending tasks.
- identifier_available() bool [source]#
Check if the item identifier is available for creating a new item.
- Returns
True if identifier is available, or False if it is not available.
- modify_metadata(metadata: Mapping, target: str | None = None, append: bool = False, append_list: bool = False, insert: bool = False, priority: int = 0, access_key: str | None = None, secret_key: str | None = None, debug: bool = False, headers: Mapping | None = None, request_kwargs: Mapping | None = None, timeout: int | float | None = None) Request | Response [source]#
Modify the metadata of an existing item on Archive.org.
Note: The Metadata Write API does not yet comply with the latest Json-Patch standard. It currently complies with version 02.
- Parameters
metadata – Metadata used to update the item.
target – Set the metadata target to update.
priority – Set task priority.
append – Append value to an existing multi-value metadata field.
append_list – Append values to an existing multi-value metadata field. No duplicate values will be added.
- Returns
A Request if debug else a Response.
Usage:
>>> import internetarchive >>> item = internetarchive.Item('mapi_test_item1') >>> md = {'new_key': 'new_value', 'foo': ['bar', 'bar2']} >>> item.modify_metadata(md)
- no_tasks_pending(params: Mapping | None = None, request_kwargs: Mapping | None = None) bool [source]#
Check if there is any pending task for the item.
- Parameters
params – Params to send with your request.
- Returns
True if no tasks are pending, otherwise False.
- remove_from_simplelist(parent, list) requests.models.Response [source]#
Remove item from a simplelist.
- Returns
- undark(comment: str, priority: int | str | None = None, reduced_priority: bool = False, data: Mapping | None = None, request_kwargs: Mapping | None = None) Response [source]#
Undark the item.
- Parameters
comment – The curation comment explaining reason for undarking item
priority – The task priority.
reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than
priority
in that it will allow you to possibly avoid rate-limiting.data – Additional parameters to submit with the task.
- Returns
- upload(files, metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive=None, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: dict | None = None, set_scanner: bool = True) list[Request | Response] [source]#
Upload files to an item. The item will be created if it does not exist.
- Parameters
files (str, file, list, tuple, dict) – The filepaths or file-like objects to upload.
**kwargs – Optional arguments that
Item.upload_file()
takes.
- Returns
A list of
requests.Response
objects.
Usage:
>>> import internetarchive >>> item = internetarchive.Item('identifier') >>> md = {'mediatype': 'image', 'creator': 'Jake Johnson'} >>> item.upload('/path/to/image.jpg', metadata=md, queue_derive=False) [<Response [200]>]
Uploading multiple files:
>>> r = item.upload(['file1.txt', 'file2.txt']) >>> r = item.upload([fileobj, fileobj2]) >>> r = item.upload(('file1.txt', 'file2.txt'))
Uploading file objects:
>>> import io >>> f = io.BytesIO(b'some initial binary data: \x00\x01') >>> r = item.upload({'remote-name.txt': f}) >>> f = io.BytesIO(b'some more binary data: \x00\x01') >>> f.name = 'remote-name.txt' >>> r = item.upload(f)
Note: file objects must either have a name attribute, or be uploaded in a dict where the key is the remote-name
Setting the remote filename with a dict:
>>> r = item.upload({'remote-name.txt': '/path/to/local/file.txt'})
- upload_file(body, key: str | None = None, metadata: Mapping | None = None, file_metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive: bool = False, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: MutableMapping | None = None, set_scanner: bool = True) Request | Response [source]#
Upload a single file to an item. The item will be created if it does not exist.
- Parameters
body (Filepath or file-like object.) – File or data to be uploaded.
key – Remote filename.
metadata – Metadata used to create a new item.
file_metadata – File-level metadata to add to the files.xml entry for the file being uploaded.
headers – Add additional IA-S3 headers to request.
queue_derive – Set to False to prevent an item from being derived after upload.
verify – Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
checksum – Skip based on checksum.
delete – Delete local file after the upload has been successfully verified.
retries – Number of times to retry the given request if S3 returns a 503 SlowDown error.
retries_sleep – Amount of time to sleep between
retries
.verbose – Print progress to stdout.
debug – Set to True to print headers to stdout, and exit without sending the upload request.
validate_identifier – Set to True to validate the identifier before uploading the file.
Usage:
>>> import internetarchive >>> item = internetarchive.Item('identifier') >>> item.upload_file('/path/to/image.jpg', ... key='photos/image1.jpg') True
internetarchive.File
#
- class File(item, name, file_metadata=None)[source]#
Bases:
internetarchive.files.BaseFile
This class represents a file in an archive.org item. You can use this class to access the file metadata:
>>> import internetarchive >>> item = internetarchive.Item('stairs') >>> file = internetarchive.File(item, 'stairs.avi') >>> print(f.format, f.size) ('Cinepack', '3786730')
Or to download a file:
>>> file.download() >>> file.download('fabulous_movie_of_stairs.avi')
This class also uses IA’s S3-like interface to delete a file from an item. You need to supply your IAS3 credentials in environment variables in order to delete:
>>> file.delete(access_key='Y6oUrAcCEs4sK8ey', ... secret_key='youRSECRETKEYzZzZ')
You can retrieve S3 keys here: https://archive.org/account/s3.php
- delete(cascade_delete=None, access_key=None, secret_key=None, verbose=None, debug=None, retries=None, headers=None)[source]#
Delete a file from the Archive. Note: Some files – such as <itemname>_meta.xml – cannot be deleted.
- Parameters
cascade_delete (bool) – (optional) Delete all files associated with the specified file, including upstream derivatives and the original.
access_key (str) – (optional) IA-S3 access_key to use when making the given request.
secret_key (str) – (optional) IA-S3 secret_key to use when making the given request.
verbose (bool) – (optional) Print actions to stdout.
debug (bool) – (optional) Set to True to print headers to stdout and exit exit without sending the delete request.
- download(file_path=None, verbose=None, ignore_existing=None, checksum=None, destdir=None, retries=None, ignore_errors=None, fileobj=None, return_responses=None, no_change_timestamp=None, params=None, chunk_size=None, stdout=None, ors=None, timeout=None)[source]#
Download the file into the current working directory.
- Parameters
file_path (str) – Download file to the given file_path.
verbose (bool) – (optional) Turn on verbose output.
ignore_existing (bool) – Overwrite local files if they already exist.
checksum (bool) – (optional) Skip downloading file based on checksum.
destdir (str) – (optional) The directory to download files to.
retries (int) – (optional) The number of times to retry on failed requests.
ignore_errors (bool) – (optional) Don’t fail if a single file fails to download, continue to download other files.
fileobj (file-like object) – (optional) Write data to the given file-like object (e.g. sys.stdout).
return_responses (bool) – (optional) Rather than downloading files to disk, return a list of response objects.
no_change_timestamp (bool) – (optional) If True, leave the time stamp as the current time instead of changing it to that given in the original archive.
stdout (bool) – (optional) Print contents of file to stdout instead of downloading to file.
ors (bool) – (optional) Append a newline or $ORS to the end of file. This is mainly intended to be used internally with stdout.
params (dict) – (optional) URL parameters to send with download request (e.g. cnt=0).
- Return type
- Returns
True if file was successfully downloaded.
internetarchive.Search
#
- class Search(archive_session, query, fields=None, sorts=None, params=None, full_text_search=None, dsl_fts=None, request_kwargs=None, max_retries=None)[source]#
Bases:
object
This class represents an archive.org item search. You can use this class to search for Archive.org items using the advanced search engine.
Usage:
>>> from internetarchive.session import ArchiveSession >>> from internetarchive.search import Search >>> s = ArchiveSession() >>> search = Search(s, '(uploader:jake@archive.org)') >>> for result in search: ... print(result['identifier'])
internetarchive.Catalog
#
- class Catalog(archive_session: ia_session.ArchiveSession, request_kwargs: Mapping | None = None)[source]#
Bases:
object
This class represents the Archive.org catalog. You can use this class to access and submit tasks from the catalog.
This is a low-level interface, and in most cases the functions in
internetarchive.api
and methods inArchiveSession
should be used.It uses the archive.org Tasks API
- Usage::
>>> from internetarchive import get_session, Catalog >>> s = get_session() >>> c = Catalog(s) >>> tasks = c.get_tasks('nasa') >>> tasks[-1].task_id 31643502
- get_summary(identifier: str = '', params: dict | None = None) dict [source]#
Get the total counts of catalog tasks meeting all criteria, organized by run status (queued, running, error, and paused).
- Parameters
identifier – Item identifier.
params – Query parameters, refer to
Tasks API for available parameters.
- Returns
the total counts of catalog tasks meeting all criteria
- get_tasks(identifier: str = '', params: dict | None = None) list[CatalogTask] [source]#
Get a list of all tasks meeting all criteria. The list is ordered by submission time.
- Parameters
identifier – The item identifier, if provided will return tasks for only this item filtered by other criteria provided in params.
params – Query parameters, refer to
Tasks API for available parameters.
- Returns
A list of all tasks meeting all criteria.
- iter_tasks(params: MutableMapping | None = None) Iterable[CatalogTask] [source]#
A generator that can make arbitrary requests to the Tasks API. It handles paging (via cursor) automatically.
- Parameters
params –
Query parameters, refer to Tasks API for available parameters.
- Returns
collections.Iterable[CatalogTask]
- make_tasks_request(params: Mapping | None) Response [source]#
- Make a GET request to the
- Parameters
params –
Query parameters, refer to Tasks API for available parameters.
- Returns
- submit_task(identifier: str, cmd: str, comment: str | None = None, priority: int = 0, data: dict | None = None, headers: dict | None = None) Response [source]#
Submit an archive.org task.
- Parameters
identifier – Item identifier.
cmd – Task command to submit, see supported task commands.
comment – A reasonable explanation for why the task is being submitted.
priority – Task priority from 10 to -10 (default: 0).
data – Extra POST data to submit with the request. Refer to Tasks API Request Entity.
headers – Add additional headers to request.
- Returns
internetarchive.ArchiveSession
#
- class ArchiveSession(config: Mapping | None = None, config_file: str = '', debug: bool = False, http_adapter_kwargs: MutableMapping | None = None)[source]#
Bases:
requests.sessions.Session
The
ArchiveSession
object collects together useful functionality from internetarchive as well as important data such as configuration information and credentials. It is subclassed fromrequests.Session
.Usage:
>>> from internetarchive import ArchiveSession >>> s = ArchiveSession() >>> item = s.get_item('nasa') Collection(identifier='nasa', exists=True)
- get_item(identifier: str, item_metadata: Mapping | None = None, request_kwargs: MutableMapping | None = None)[source]#
A method for creating
internetarchive.Item
andinternetarchive.Collection
objects.- Parameters
identifier – A globally unique Archive.org identifier.
item_metadata – A metadata dict used to initialize the Item or Collection object. Metadata will automatically be retrieved from Archive.org if nothing is provided.
request_kwargs – Keyword arguments to be used in
requests.sessions.Session.get()
request.
- get_metadata(identifier: str, request_kwargs: MutableMapping | None = None)[source]#
Get an item’s metadata from the Metadata API
- Parameters
identifier – Globally unique Archive.org identifier.
- Returns
Metadat API response.
- get_my_catalog(params: dict | None = None, request_kwargs: Mapping | None = None) set[catalog.CatalogTask] [source]#
Get all queued or running tasks.
- Parameters
params –
Query parameters, refer to Tasks API for available parameters.
request_kwargs – Keyword arguments to be used in
requests.sessions.Session.get()
request.
- Returns
A set of all queued or running tasks.
- get_task_log(task_id: str | int, request_kwargs: Mapping | None = None) str [source]#
Get a task log.
- Parameters
task_id – The task id for the task log you’d like to fetch.
request_kwargs – Keyword arguments that
requests.Request
takes.
- Returns
The task log as a string.
- get_tasks(identifier: str = '', params: dict | None = None, request_kwargs: Mapping | None = None) set[catalog.CatalogTask] [source]#
Get a list of all tasks meeting all criteria. The list is ordered by submission time.
- Parameters
identifier – The item identifier, if provided will return tasks for only this item filtered by other criteria provided in params.
params –
Query parameters, refer to Tasks API for available parameters.
request_kwargs – Keyword arguments to be used in
requests.sessions.Session.get()
request.
- Returns
A set of all tasks meeting all criteria.
- get_tasks_summary(identifier: str = '', params: dict | None = None, request_kwargs: Mapping | None = None) dict [source]#
Get the total counts of catalog tasks meeting all criteria, organized by run status (queued, running, error, and paused).
- Parameters
identifier – Item identifier.
params –
Query parameters, refer to Tasks API for available parameters.
request_kwargs – Keyword arguments to be used in
requests.sessions.Session.get()
request.
- Returns
Counts of catalog tasks meeting all criteria.
- iter_catalog(identifier: str | None = None, params: dict | None = None, request_kwargs: Mapping | None = None) Iterable[catalog.CatalogTask] [source]#
A generator that returns queued or running tasks.
- Parameters
identifier – Item identifier.
params –
Query parameters, refer to Tasks API for available parameters.
request_kwargs – Keyword arguments to be used in
requests.sessions.Session.get()
request.
- Returns
An iterable of queued or running CatalogTasks.
- iter_history(identifier: str | None, params: dict | None = None, request_kwargs: Mapping | None = None) Iterable[catalog.CatalogTask] [source]#
A generator that returns completed tasks.
- Parameters
identifier – Item identifier.
params –
Query parameters, refer to Tasks API for available parameters.
request_kwargs – Keyword arguments to be used in
requests.sessions.Session.get()
request.
- Returns
An iterable of completed CatalogTasks.
- mount_http_adapter(protocol: str | None = None, max_retries: int | None = None, status_forcelist: list | None = None, host: str | None = None) None [source]#
Mount an HTTP adapter to the
ArchiveSession
object.- Parameters
protocol – HTTP protocol to mount your adapter to (e.g. ‘https://’).
max_retries – The number of times to retry a failed request. This can also be an urllib3.Retry object.
status_forcelist – A list of status codes (as int’s) to retry on.
host – The host to mount your adapter to.
- search_items(query: str, fields: Iterable[str] | None = None, sorts: Iterable[str] | None = None, params: Mapping | None = None, full_text_search: bool = False, dsl_fts: bool = False, request_kwargs: Mapping | None = None, max_retries: int | Retry | None = None) Search [source]#
Search for items on Archive.org.
- Parameters
query – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
fields – The metadata fields to return in the search results.
params – The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
full_text_search – Beta support for querying the archive.org Full Text Search API [default: False].
dsl_fts – Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend
!L `` to the ``full_text_search
query [default: False].
- Returns
A
Search
object, yielding search results.
- send(request, **kwargs) requests.models.Response [source]#
Send a given PreparedRequest.
- Return type
- set_file_logger(log_level: str, path: str, logger_name: str = 'internetarchive') None [source]#
Convenience function to quickly configure any level of logging to a file.
- Parameters
log_level – A log level as specified in the logging module.
path – Path to the log file. The file will be created if it doesn’t already exist.
logger_name – The name of the logger.
- submit_task(identifier: str, cmd: str, comment: str = '', priority: int = 0, data: dict | None = None, headers: dict | None = None, reduced_priority: bool = False, request_kwargs: Mapping | None = None) requests.Response [source]#
Submit an archive.org task.
- Parameters
identifier – Item identifier.
cmd –
Task command to submit, see supported task commands.
comment – A reasonable explanation for why the task is being submitted.
priority – Task priority from 10 to -10 (default: 0).
data –
Extra POST data to submit with the request. Refer to Tasks API Request Entity.
headers – Add additional headers to request.
reduced_priority – Submit your derive at a lower priority. This option is helpful to get around rate-limiting. Your task will more likely be accepted, but it might not run for a long time. Note that you still may be subject to rate-limiting. This is different than
priority
in that it will allow you to possibly avoid rate-limiting.request_kwargs – Keyword arguments to be used in
requests.sessions.Session.post()
request.
- Returns
internetarchive.api
#
internetarchive.api#
This module implements the Internetarchive API.
- copyright
2012-2019 by Internet Archive.
- license
AGPL 3, see LICENSE for more details.
- configure(username: str = '', password: str = '', config_file: str = '', host: str = 'archive.org') str [source]#
Configure internetarchive with your Archive.org credentials.
- Parameters
username – The email address associated with your Archive.org account.
password – Your Archive.org password.
- Returns
The config file path.
- Usage:
>>> from internetarchive import configure >>> configure('user@example.com', 'password')
- delete(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, cascade_delete: bool = False, access_key: str | None = None, secret_key: str | None = None, verbose: bool = False, debug: bool = False, **kwargs) list[requests.Request | requests.Response] [source]#
Delete files from an item. Note: Some system files, such as <itemname>_meta.xml, cannot be deleted.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – Only return files matching the given filenames.
formats – Only return files matching the given formats.
glob_pattern – Only return files matching the given glob pattern.
cascade_delete – Delete all files associated with the specified file, including upstream derivatives and the original.
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
verbose – Print actions to stdout.
debug – Set to True to print headers to stdout and exit exit without sending the delete request.
- Returns
A list Requests if debug else a list of Responses
- download(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, dry_run: bool = False, verbose: bool = False, ignore_existing: bool = False, checksum: bool = False, destdir: str | None = None, no_directory: bool = False, retries: int | None = None, item_index: int | None = None, ignore_errors: bool = False, on_the_fly: bool = False, return_responses: bool = False, no_change_timestamp: bool = False, timeout: int | float | tuple[int, float] | None = None, **get_item_kwargs) list[requests.Request | requests.Response] [source]#
Download files from an item.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – Only return files matching the given file names.
formats – Only return files matching the given formats.
glob_pattern – Only return files matching the given glob pattern.
dry_run – Print URLs to files to stdout rather than downloading them.
verbose – Turn on verbose output.
ignore_existing – Skip files that already exist locally.
checksum – Skip downloading file based on checksum.
destdir – The directory to download files to.
no_directory – Download files to current working directory rather than creating an item directory.
retries – The number of times to retry on failed requests.
item_index – The index of the item for displaying progress in bulk downloads.
ignore_errors – Don’t fail if a single file fails to download, continue to download other files.
on_the_fly – Download on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
return_responses – Rather than downloading files to disk, return a list of response objects.
**kwargs – Optional arguments that
get_item
takes.
- Returns
A list Requests if debug else a list of Responses.
- get_files(identifier: str, files: files.File | list[files.File] | None = None, formats: str | list[str] | None = None, glob_pattern: str | None = None, exclude_pattern: str | None = None, on_the_fly: bool = False, **get_item_kwargs) list[files.File] [source]#
Get
File
objects from an item.- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – Only return files matching the given filenames.
formats – Only return files matching the given formats.
glob_pattern – Only return files matching the given glob pattern.
exclude_pattern – Exclude files matching the given glob pattern.
on_the_fly – Include on-the-fly files (i.e. derivative EPUB, MOBI, DAISY files).
**get_item_kwargs – Arguments that
get_item()
takes.
- Returns
Files from an item.
- Usage:
>>> from internetarchive import get_files >>> fnames = [f.name for f in get_files('nasa', glob_pattern='*xml')] >>> print(fnames) ['nasa_reviews.xml', 'nasa_meta.xml', 'nasa_files.xml']
- get_item(identifier: str, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) item.Item [source]#
Get an
Item
object.- Parameters
identifier – The globally unique Archive.org item identifier.
config – A dictionary used to configure your session.
config_file – A path to a config file used to configure your session.
archive_session – An
ArchiveSession
object can be provided via thearchive_session
parameter.debug – To be passed on to get_session().
http_adapter_kwargs – Keyword arguments that
requests.adapters.HTTPAdapter
takes.request_kwargs – Keyword arguments that
requests.Request
takes.
- Returns
The Item that fits the criteria.
- Usage:
>>> from internetarchive import get_item >>> item = get_item('nasa') >>> item.item_size 121084
- get_session(config: Mapping | None = None, config_file: str | None = None, debug: bool = False, http_adapter_kwargs: MutableMapping | None = None) session.ArchiveSession [source]#
Return a new
ArchiveSession
object. TheArchiveSession
object is the main interface to theinternetarchive
lib. It allows you to persist certain parameters across tasks.- Parameters
config – A dictionary used to configure your session.
config_file – A path to a config file used to configure your session.
debug – To be passed on to this session’s method calls.
http_adapter_kwargs – Keyword arguments that
requests.adapters.HTTPAdapter
takes.
- Returns
To persist certain parameters across tasks.
Usage:
>>> from internetarchive import get_session >>> config = {'s3': {'access': 'foo', 'secret': 'bar'}} >>> s = get_session(config) >>> s.access_key 'foo'
From the session object, you can access all of the functionality of the
internetarchive
lib:>>> item = s.get_item('nasa') >>> item.download() nasa: ddddddd - success >>> s.get_tasks(task_ids=31643513)[0].server 'ia311234'
- get_tasks(identifier: str = '', params: dict | None = None, config: Mapping | None = None, config_file: str | None = None, archive_session: session.ArchiveSession | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: MutableMapping | None = None) set[catalog.CatalogTask] [source]#
Get tasks from the Archive.org catalog.
- Parameters
identifier – The Archive.org identifier for which to retrieve tasks for.
params – The URL parameters to send with each request sent to the Archive.org catalog API.
- Returns
A set of
CatalogTask
objects.
- get_user_info(access_key: str, secret_key: str) dict[str, str] [source]#
Returns details about an Archive.org user given an IA-S3 key pair.
- Parameters
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
- Returns
Archive.org use info.
- get_username(access_key: str, secret_key: str) str [source]#
Returns an Archive.org username given an IA-S3 key pair.
- Parameters
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
- Returns
The username.
- modify_metadata(identifier: str, metadata: Mapping, target: str | None = None, append: bool = False, append_list: bool = False, priority: int = 0, access_key: str | None = None, secret_key: str | None = None, debug: bool = False, request_kwargs: Mapping | None = None, **get_item_kwargs) requests.Request | requests.Response [source]#
Modify the metadata of an existing item on Archive.org.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
metadata – Metadata used to update the item.
target – The metadata target to update. Defaults to metadata.
append – set to True to append metadata values to current values rather than replacing. Defaults to
False
.append_list – Append values to an existing multi-value metadata field. No duplicate values will be added.
priority – Set task priority.
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
debug – set to True to return a
requests.Request
object instead of sending request. Defaults toFalse
.**get_item_kwargs – Arguments that
get_item
takes.
- Returns
A Request if debug else a Response.
- search_items(query: str, fields: Iterable | None = None, sorts=None, params: Mapping | None = None, full_text_search: bool = False, dsl_fts: bool = False, archive_session: session.ArchiveSession | None = None, config: Mapping | None = None, config_file: str | None = None, http_adapter_kwargs: MutableMapping | None = None, request_kwargs: Mapping | None = None, max_retries: int | Retry | None = None) search.Search [source]#
Search for items on Archive.org.
- Parameters
query – The Archive.org search query to yield results for. Refer to https://archive.org/advancedsearch.php#raw for help formatting your query.
fields – The metadata fields to return in the search results.
params – The URL parameters to send with each request sent to the Archive.org Advancedsearch Api.
full_text_search – Beta support for querying the archive.org Full Text Search API [default: False].
dsl_fts – Beta support for querying the archive.org Full Text Search API in dsl (i.e. do not prepend
!L `` to the ``full_text_search
query [default: False].secure – Configuration options for session.
config_file – A path to a config file used to configure your session.
http_adapter_kwargs – Keyword arguments that
requests.adapters.HTTPAdapter
takes.request_kwargs – Keyword arguments that
requests.Request
takes.max_retries –
The number of times to retry a failed request. This can also be an urllib3.Retry object. If you need more control (e.g. status_forcelist), use a ArchiveSession object, and mount your own adapter after the session object has been initialized. For example:
>>> s = get_session() >>> s.mount_http_adapter() >>> search_results = s.search_items('nasa')
See
ArchiveSession.mount_http_adapter()
for more details.
- Returns
A
Search
object, yielding search results.
- upload(identifier: str, files, metadata: Mapping | None = None, headers: dict | None = None, access_key: str | None = None, secret_key: str | None = None, queue_derive=None, verbose: bool = False, verify: bool = False, checksum: bool = False, delete: bool = False, retries: int | None = None, retries_sleep: int | None = None, debug: bool = False, validate_identifier: bool = False, request_kwargs: dict | None = None, **get_item_kwargs) list[requests.Request | requests.Response] [source]#
Upload files to an item. The item will be created if it does not exist.
- Parameters
identifier – The globally unique Archive.org identifier for a given item.
files – The filepaths or file-like objects to upload. This value can be an iterable or a single file-like object or string.
metadata – Metadata used to create a new item. If the item already exists, the metadata will not be updated – use
modify_metadata
.headers – Add additional HTTP headers to the request.
access_key – IA-S3 access_key to use when making the given request.
secret_key – IA-S3 secret_key to use when making the given request.
queue_derive – Set to False to prevent an item from being derived after upload.
verbose – Display upload progress.
verify – Verify local MD5 checksum matches the MD5 checksum of the file received by IAS3.
checksum – Skip uploading files based on checksum.
delete – Delete local file after the upload has been successfully verified.
retries – Number of times to retry the given request if S3 returns a 503 SlowDown error.
retries_sleep – Amount of time to sleep between
retries
.debug – Set to True to print headers to stdout, and exit without sending the upload request.
validate_identifier – Set to True to validate the identifier before uploading the file.
**kwargs – Optional arguments that
get_item
takes.
- Returns
A list Requests if debug else a list of Responses.